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Abstract 

Background: Quantification of protein expression by means of mass spectrometry (MS) has been introduced in 
various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and 
spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone 
of both methods is peptide identification based on a proteomic database search and subsequent estimation of 
peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of 
the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on 
the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best 
match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be 
identified unless they have been previously generated and stored into the database or spectral libraries. 
To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the 
Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass 
spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH 
method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, 
significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets. 

Results: We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal 
liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 
44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were 
identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC 
and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, 
further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess 
differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially 
expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly 
differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 
peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human 
liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH 
were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides 
are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers. 

Conclusions: We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and 
simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on 
human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be 
a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to 
be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods. 
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Background 

The main objective of functional proteomics analysis is 
often to estimate changes in the amount of proteins found 
in complex biological systems, in response to physiological 
and clinical factors such as cell development, disease pro- 
gression, or drug treatment. In particular, one of the key 
issues in proteomics research based on tandem mass spec- 
trometry (MS/MS) is the identification of protein species 
and the characterization of their expression changes in 
normal and disease samples. Three analysis techniques are 
often required in an MS/MS study: expressed peptide 
identification, target protein characterization, and quantifi- 
cation [1]. For hundreds to tens of thousands of fragment 
ion spectra generated, the assignment of the fragment ion 
spectra to peptide sequences, the identification of proteins 
represented by each peptide, and the estimation of their 
abundances in the analyzed sample require complex com- 
putations and still remain as high statistical challenges [2]. 

Quantification of protein expression using mass spectro- 
metry (MS) is often required for the discovery of protein 
biomarkers associated with cancer, their response to sti- 
muli, cell signalling cascades and the function of cell 
cycle-promoting proteins, and various biomedical investi- 
gations [3]. Two categories of quantification methods for 
MS data have been used: stable isotope labelling quantifi- 
cation and label-free quantification [2] . 

Several stable isotope-based quantification methods 
have been introduced based on different labelling 
reagents that can be chemically bound to peptides [4] . It 
is, however, difficult to simultaneously quantify the 
amount of proteins/peptides in multiple samples because 
of the limited number of labelling reagents available [5]. 
Moreover, current practical applications can typically 
quantify, at most, a few hundreds of peptides, measuring 
relative expression values of each pair of contrasting sam- 
ples. Furthermore, the high costs of labelling reagents 
make these quantification methods difficult to be com- 
monly applied for the characterization of the global 
proteome. 

On the other hand, label-free quantification, which does 
not require the use of a stable isotope labeling, has the 
advantages of low cost and simplicity. Currently, two 
label-free methods are available to measure expression 
levels of peptides: spectra counting and spectra feature 
analysis. The spectral counting method can estimate the 
peptide expression levels by means of spectrum counting 
(from MS/MS data) or through the estimation of the inte- 
grated ion intensities [6,7]. The spectral feature analysis 
method quantitatively determines the peptide expression 
levels by comparing three-dimensional patterns (retention 
time, m/z and intensity) between different samples [8-13]. 

However, these label-free quantitative methods have 
two main shortcomings. The first limitation is due to 
numerous false-positive discriminative peptides, which 



are the result of the chromatographic variability between 
LC-MS experiments. In the analysis of the spectra 
features, after finding two candidates with same MSI 
retention time and m/z, the difference in their MSI 
intensities is used to define the peptide levels. Therefore, 
spectra feature analysis requires stringent reproducibility 
[3,8] and additional pre-processing of the LC normaliza- 
tion or retention time alignment [14,15]. 

The second limitation is that spectra counting cannot be 
performed without peptide identification because the rela- 
tive peptide levels can be quantified only after peptide 
identification. In peptide identification, MS/MS spectra 
are verified using a database searching algorithm or 
spectral library searching algorithm such as SEQUEST, 
MASCOT, or SpectraST. Specifically, database search 
algorithms calculate score functions to compare the 
experimental MS/MS spectra with theoretical MS/MS 
spectra of peptides derived from protein sequence data- 
bases. The pool of theoretical MS/MS spectra is restricted 
by user-specified criteria such as mass tolerance, proteoly- 
tic enzymes, and the types of post-translational modifica- 
tion [2,16]. A number of spectra may not be assigned 
tothe correct peptides for diverse reasons, including 
deficiencies of the scoring scheme implemented in the 
database search tools, sequence variations (e.g., single 
nucleotide polymorphisms, SNPs), omissions in the data- 
base searched, post-translational or chemical modifica- 
tions of the peptide analyzed, and the observation of 
genomic sequences that are not anticipated (e.g., splice 
forms, somatic rearrangement, and processed proteins) 
[17]. For all these reasons, a large number of important 
peptides may be lost during the database search. 

Instead of matching acquired MS/MS spectra against 
theoretically predicted spectra, MS/MS spectra can also be 
assigned to peptides by matching those in a spectral 
library. The spectral library is compiled from a large col- 
lection of experimentally observed MS/MS spectra identi- 
fied in previous experiments [18]. Generally, a set of 
spectra of known peptide sequences is collected into a 
library and used as a reference. The experimental spec- 
trum may be identified by a similar match in the library. 
However, this method can only be identified when spectra 
were observed previously and entered into the library. So, 
these library searching methods are well suited for tar- 
geted proteomics, in which one seeks not to discover pre- 
viously unseen peptides, but rather limited to finding and 
quantifying expected peptides of interest in the sample 
[19]. 

To overcome these limitations of label-free quantifica- 
tion methods, we propose a novel spectral counting 
method to estimate a peptide's abundance by counting 
MS/MS spectra, comparing and clustering all experimen- 
tally observed spectra. This approach has several advan- 
tages. First, because the same peptide may be fragmented 
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multiple times or repeatedly observed at different time 
points from an MS/MS run, multiple spectra may be 
extracted for the same peptides. In other words, dupli- 
cated spectra are ubiquitous in large-scale proteomics 
data [20]. Our method thus attempts to identify and 
group all the duplicate spectra, which allows us to quan- 
tify the amount of peptide found in complex biological 
systems without searching through the databases or 
using LC normalization. 

For the given spectra, our method, referred to as the 
Quantification method derived by Finding the Identical 
Spectra set for a Homogenous peptide (Q-FISH) employs a 
two-stage clustering algorithm to determine whether they 
are from the same peptides with homogeneous spectral 
patterns. The Q-FISH algorithm employs two similarity 
measures: the difference between two precursor ions and 
the correlation coefficient of moving window averages. 
Subsequently, the algorithm clusters spectra from the 
same peptide through all plausible pair-wise comparisons. 
By counting the spectra of each cluster set of peptides, we 
can estimate the amount of peptides. Figure 1 summarizes 
the workflow of the proposed Q-FISH algorithm. 

Our proposed algorithm was applied to identify differ- 
entially expressed peptides from a real data obtained 
during a Nano-LC-MS/MS experiment performed on 
human HCC and normal liver tissue samples. 

Results & Discussion 

We introduced and tested the so-called Q-FISH algorithm 
to identify and quantify the amount of all expressed pep- 
tides from an MS/MS dataset by clustering and counting 
spectra with homogeneous spectral patterns. In order to 
test our algorithm, we performed a Nano-LC MS/MS 
experiment with triplicated human hepatocellular carci- 
noma and normal liver tissue samples. For a total of 
44,318 MS/MS spectra obtained through three MS/MS 
analysis for two samples, Q-FISH yielded 14,748 clusters. 
More specifically, 5,777 clusters were identified only in the 
hepatocellular carcinoma (HCC) sample, 6,648 clusters 
only in the normal sample, and 2,323 clusters in both 
HCC and normal samples. For the purpose of comparison, 
we also implemented SEQUEST and SpectraST to identify 
peptides. However, only 4,824 of 44,318 spectra were iden- 
tified using SEQUEST, and a total of 1,326 peptides from 
the experimental spectra. Generally, most database search 
algorithms including SEQUEST assign specific experimen- 
tal spectra to peptides by comparing the experimental data 
with theoretical spectra generated from the peptide 
sequence. It should be noted that neither the best match 
nor a high search score may not be a true match, espe- 
cially for novel protein targets. Therefore, many peptides 
could be misidentified, or not be identified, unless 
they were previously generated and stored into the data- 
base sequence. In our experiments, a large number of 
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Figure 1 Work flow chart. This figure shows a flow schematic of 
the analysis process performed by Q-FISH algorithm 



experimental spectra (89.12%, namely 39,494 of a total of 
44,318 spectra) could not be used for the peptide identifi- 
cation using SEQUEST. On the other hands, 5,549 spectra 
and 3,295 peptides could be identified using SpectraST. 
That is, a large number of spectra still could not be used 
for the peptide identification by SpectraST (87.48%, 
namely 38,769 of a total of 44,318 spectra). On the other 
hand, our proposed method directly compares all observed 
experimental spectra to discover differentially expressed 
peptides without a loss of observed spectra. 

The standardized intensities of the experimental spectra 
plotted in Figure 2 are characterized by positive intensity 
values (upper part) and the reference spectrum plotted 
using negative intensity values (lower part). Specifically, 
Figure 2(a), which illustrates an example of one cluster 
with nine similar spectra, shows spectral patterns of the 
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Figure 2 Pattern-plots of reference spectrum and experimental MS/MS spectra in clustered spectral sets This figure shows pattern-plots 
the of the experimental MS/MS spectra with plotted using positive intensities (upper part) and the reference spectrum using negative intensities 
(lower part). Then, (a) all of nine spectra were identified as a same peptide, while (b) two of the eleven spectra are not identified by SEQUEST 
and (c) four of the seven spectra were only identified by SpectraST, although pattern-plots are very similar. 



MS/MS spectra as well as the reference spectrum for clus- 
tered spectral set. The overall patterns look quite similar 
and all nine spectra pairs seem to have almost identical 
patterns. Table 1 shows the search results returned by 
SEQUEST and SpectraST. Subsequently, in the case of 
spectral set S366006, nine spectra were identified by 
means of the same peptide sequence, "SIFSAVLDELK" in 
the SEQUEST and SpectraST with XCorr above 1.97. In 
addition, a reference spectrum for the clustered spectral 
set was identified as the peptide sequence, "SIFSAVL- 
DELK" with a SEQUEST score, XCorr = 2.96. This analy- 
sis reveals that these spectra can be regarded as the 
spectra of a homogenous peptide. In other words, each 



cluster could be expected to be composed of spectra from 
the same peptide. 

Similarly, Figures 2(b) and 2(c) show spectral patterns 
for the reference spectrum and the experimental spectra 
of a single cluster. It should be noted that the overall pat- 
terns look quite similar and all spectra pairs are character- 
ized by high correlation coefficients. However, while all 
spectra in SI 157004 could be identified by SpectraST, two 
out of the eleven spectra could not be identified by 
SEQUEST, as shown Table 1. On the contrary, all spectra 
in S65002 are identified by SEQUEST with high scores, 
while three spectra could not be identified by SpectraST. 
In other words, if we relied only on the conventional 
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Table 1 Results of SEQUEST & SpectraST for spectra in clustered spectral sets 
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These spectra were clustered by the proposed Q-FISH algorithm. In the case of spectral set S366006, all spectra in spectral set were identified as a same peptide 
sequence "SIFSAVLDELK" by both of SEQUEST and SpectraST, while two spectra in S1 157004 are not identified by SEQUEST (XCorr < 2.11). Also, all spectra in 
S65002 are identified by SEQUEST with high scores, while four spectra were only identified by SpectraST. If we relied only on SEQUEST or SpectraST, these 
spectra in S1 157004 or S65002 would be excluded. 



peptide identification such as SEQUEST or SpectraST, 
these spectra would have been excluded despite the similar 
peak patterns. On the other hand, our Q-FISH algorithm 
was able to include these spectra without a loss of 
information. 

In this study, we were interested in identifying proteins 
and characterizing their differential expressions in normal 
and HCC samples. Hence, we first focused on the 2,323 
clusters, which were observed in both samples. Figure 3 
and Table 2 show a scatter plot and a correlation matrix 
with the number of spectra in the same cluster, which 
were obtained through the replicated experiments on 
HCC and normal tissue samples, respectively. It is worth 
noting that the number of spectra in the same cluster 
showed high correlations (0.7178-0.8315), while the num- 
ber of spectra for different samples showed weak correla- 
tions (0.0654-0.1549). For a given spectral set, the 
reference spectrum was estimated by averaging the relative 
intensities of the spectra. Consequently, the reference 
spectrum corresponds to the number of expressed spectra 



in the normal and HCC samples. We computed the false 
clustering rate (FCR) on the 2,323 clusters shared by the 
HCC and normal samples. Among these clusters, 1,571 
clusters had FCRs smaller than 0.05. Our next step was to 
perform a beta-binomial test to isolate differentially 
expressed peptides (DEPs) [21]. The result showed that 
only 84 out of the 1,571 reference spectra were character- 
ized by different spectral counts between the HCC and 
normal tissue samples. Also, 5,777 clusters were observed 
only in the HCC sample and 6,648 clusters only in the 
normal sample by Q-FISH. Among these clusters, 1,571 
and 1,556 clusters, respectively, had FCRs smaller than 
0.05. 

In order to compare the performance of Q-FISH with 
the spectral counting method by SEQUEST, we used the 
human liver data and validated the results through litera- 
ture search. For the human liver data, Q-FISH provided 
1571 differentially expressed clusters for HCC sample and 
1556 for normal sample, among which 57 and 99 clusters 
were identified by SEQUEST in HCC and normal samples, 
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Figure 3 Scatter plot between different samples and within replicated samples. This figure represents the scatter plot with the number of 
spectra in clustered sets obtained through the replicated experiments on HCC and normal tissue samples, respectively. Then, two black boxes 
show the relationships of the number of spectra in replicated each HCC and normal samples, while a gray box represents the relationships of 
the number of spectra in clustered sets between HCC and normal samples. 



respectively. On the other hand, SEQUEST provided 93 
and 145 peptides for HCC and normal tissue samples, 
respectively. Among the 57 identified clusters in HCC 
samples, 37 clusters were found to be over-expressed by 
Q-FISH; 20 peptides/clusters were overlapped by Q-FISH 
and SEQUEST. On the other hands, 73 peptides were 
identified only by SEQUEST. 49 peptides/clusters were 
identified as over-expressed by both Q-FISH and 
SEQUEST in normal sample. Also, 50 and 96 peptides/ 



clusters were identified as over-expressed only by Q-FISH 
and SEQUEST, respectively. 

We compared two results through literature search. We 
assumed that it is a true match if a peptide was reported 
in a previous literature in cancer. While there is a certain 
degree of uncertainty for reported protein biomarkers, this 
assumption is not biased to any of the two methods and 
allowed us to statistically compare their performance. For 
examples, alpha-2-macroglobulin (A2M) annotated by 
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Table 2 Correlation matrix and the number of shared 
spectral clusters between different samples and within 
replicated samples 
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a : correlation coefficient, b : # of spectral clusters, and c : # of shared spectral 
clusters. 

This table shows correlation matrix with number of spectra in same cluster 
between different samples and within replicated samples. The number of 
spectra in the same cluster within replicated samples showed high 
correlations, while the number of spectra between different samples showed 
weak correlations. 

"VSVQLEASPAFLAVPVEK" was reported to be over- 
expressed in HCC sample [22] . This peptide was found to 
be over-expressed by Q-FISH, but under-expressed by 
spectral counting analysis by SEQUEST. The full list of 
peptides is given in Additional file 1. Based on this report, 
the 2x2 confusion tables can be constructed as shown in 
Table 3. 

For Q-FISH result, 65 peptides were found in the literature: 
31 for HCC sample and 34 for normal sample. Among 31 
peptides for HCC sample, 25 are reported as over- 
expressed in the literature, and are assumed to be correctly 
identified. Among 17 peptides for normal sample, 17 are 
reported as under-expressed in the literature, and thus are 
assumed to be correctly identified. The remaining 17 and 6 
peptides are assumed incorrectly identified. 

For SEQUEST result, 93 peptides were reported in the 
literature: 43 for HCC sample and 50 for normal sample. 



Table 3 2x2 tables for literature search results of 
Q-FISH and SEQUEST 







Q-FISH 






SEQUEST 






HCC 


Normal 


Total 


HCC 


Normal 


Total 


Literature Over- 


25 


17 


42 


34 


26 


60 


Expressed 














Under- 


6 


17 


23 


9 


24 


33 


Expressed 














Total 


31 


34 


65 


43 


50 


93 


Accuracy 


64.62% 






62.37% 





We assume that if a peptide is reported in a previous literature, it is assumed 
to be correctly identified. We compared two results (Q-FISH and SEQUEST) 
through literature search. Based on this report, the following 2x2 tables can 
be constructed 



Among them, 34 and 24 peptides were correctly identi- 
fied, while 26 and 9 peptides were incorrectly identified. 
Based on these numbers, accuracy measure was com- 
puted showing that Q-FISH (accuracy = 64.62%) has 
slightly higher accuracy than SEQUEST (accuracy = 
62.37%). This comparison showed that Q-FISH per- 
formed as reliably as SEQUEST, despite the comparison 
giving SEQUEST a natural advantage. 

Table 4 provides a list of potential protein biomarkers. Q 
scores were calculated by averaging the correlation coeffi- 
cient between moving averages over the reference spec- 
trum and experimental spectra of the clustered spectral 
set. If it has a relatively high value, then the reference 
spectrum is well represented in the clustered spectral set. 

To find the potential biomarkers in each sample, we 
searched the reference spectra of clusters using SEQUEST. 
Consequently, we could find 50 and 95 peptides as the 
candidate biomarkers from HCC sample and normal sam- 
ple, respectively, as shown Table 4. Among them, 24 pep- 
tides in HCC sample and 56 peptides in normal samples 
are known biomarkers for the human liver cancer. Also, 
22 reference spectra among 84 DEPs were identified by 
SEQUEST. Among them, 13 peptides are known markers 
for the human liver cancer, too. 

As shown in Table 4, carbamoyl-phosphate synthetase 
1 (CPS1) are annotated by various sequences such as 
"MEYDGILIAGGPGNPALAEPLIQNVR" "SIFSAVL- 
DELK", "TAVDSGIPLLTNFQVTK" and "GLNSESM- 
TEETLK". These sequences are underexpressed in the 
HCC sample. Kinoshita et al. [23] performed differential 
gene display analysis (DGDA) to compare the intensities 
of polymerase chain reaction (PCR) products and evalu- 
ated the degrees of mRNA expression in HCC tissue 
samples and noncancerous hepatitis tissues. Sub- 
sequently, they confirmed that CPS1 is underexpressed. 
Specifically, CPS1 synthesizes carbamyl phosphate 
from bicarbonate, adenosine triphosphate (ATP) and 
ammmonia. A genetic mutation of CPS1 was identified 
as the source of hyperammonemia. In HCC tissue sam- 
ples, underexpression of the CPS1 gene had been 
reported in rats, but the scientists' study was the first to 
result in such a finding for humans [23]. Heterogeneous 
nuclear ribonucleoprotein C (HNRNPC) annotated as 
"MIAGQVLDINLAAEPK" and actin, cytoplasmic 1 
(ACTB) annotated as "DLYANTVLSGGTTMYPGIADR" 
were found to be over-expressed in the HCC sample 
[24,25]. On the contrary, glutathione S-transferase 
(GSTA1) annotated as "NDGYLMFQQVPMVEIDGMK" 
has been down-regulated in the human HCC sample 
[26]. Moreover, fatty acid-binding protein (FABP1) anno- 
tated as "SVTELNGDIITNTMTLGDIVFK", and Isoform 
1 of Liver carboxylesterase 1 (CES1) annotated as "EGYL- 
QIGANTQAAQK" are all characteristic of the HCC 
sample [27,28]. 
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Table 4 Lists of differentially expressed peptides in HCC and normal sample. 



HCC sample 



R^l atari 


Gene 


Shocjun Sequence 








rlOItrin INdlTlc 




Cancer 


Name 








Score 






i— \rc 
HLL 


A l/D 1 Din 

AI\K lulU 


IVblNIUVrUrK 


2 


2.04 


0.95 


Aldo-keto reductase family 1 member 
R1 n 

L) I U 


ZU38884D 




A 1 R 
■■-\Lb 


n\/Fi ^'^/lFl vpvar 

UvTIAjlVIrL Y ti An 


-> 
z 


i i~\a 

zi.U- 1 


u.yo 


Putative uncharacterized protein ALB 


Tn^Q^3£ 




ECHDC3 


\/inr ft rrn\ /rrrn \r~\\ \/ 

VlllbAEGPVFSSGHDLK 


2 


2.14 


0.95 


Isoform 1 of Enoyl-CoA hydratase 


21 495032 














nnma i n-r-nnt^in i nn t~i("f~itpin 3 
UUI 1 Id 1 1 1 l-Ullldllllliy kJIULCMI J, 
















mitochondria 






EEF1 A2 


THIMIWIGHVD^GK 

II IMNIVVIvJI IV UJUI\ 


3 


2 29 


0 83 


FL~innatir~in T^rtr^r 1-3 nna 9 
LIUI lUdllUI 1 dLLUI 1 o 1 Id Z 


IOIUI UJU 




FFF9 


AVI P\/MF^Ff^FTAni R 


■D 

j 


J.jU 


n 

u.yo 


Elongation factor 2 


1 Q 1 f. 1 QAC] 

1 o 1 O i yHU 






FTA^Ar;in\A/r^nni ta/tmpk" 

r 1 AjAuIL>VVuUlJL 1 V 1 INr r\ 


33 

J J 


Z.J I 


u.o 


soform alpha-enolase of Alpha- 


I oo I J/OJ 














enolase 






FGG 


EGFGHLSPTGTFEFWLGNEK 


2 


3.16 


0,95 


isoform Gamma-B of Fibrinogen 


19596924 














gamma chain 






FN 1 


SSPWIDASTAIDAPSNLR 


2 


2.45 


0.96 


soform 1 of Fibronectin 


16820872 




FTCD 


EAQELSLPWGSQLVGLVPLK 


2 


2.98 


0.99 


soform A of Formimidoyltransferase- 


18571811 














cyclodeaminase 






GAPDH 


WGDAGAEYWESTGVFTTMEK 


5 


3.57 


0.96 


Glyceraldehyde-3-phosphate 


20714864 














dehydrogenase 






HBD 


FFESFGDLSSPDAVMGNPK 


2 


2.37 


0.96 


Hemoglobin subunit delta 


9214599 




HM0X1 


ALDLPSSGEGLAFFTFPNIASATK 


2 


2.82 


0.90 


Heme oxygenase 1 


20664735 




HRSP12 


E 1 E AVAI QG P LTTAS L 


2 


2.31 


0.98 


Ribonuclease UK1 14 


18349270 




HSPA5 


NQLTSNPENTVFDAK 


A 


2.51 


0.97 


HSPA5 protein 


19445531 




HSPA9 


VINEPTAAALAYGLDK 


2 


2.04 


0.93 


Stress-70 protein, mitochondrial 


18334731 






DIVMTQSPDSLAVSLGER 


2 


2.52 


0.99 








njrU I 


ai mi nr~\/ni i AnAVAVATMr^Dk' 


J 


Z.DD 


u.yj 


60 kDa heat shock protein, 


z i d jjooy 














mitochondria) 






NME1 


VMLGETNPADSKPGTIR 


2 


2.57 


0.97 


soform 1 of Nucleoside diphosphate 


1 7594820 














kinase A 








EISLWFKPEELVDYK 


2 


2.27 


0.95 








P4HB 


VDATEESDLAQQYGVR 


2 


2.38 


0.81 


Protein disulfide-isomerase 


21207424 




PRDX6 


LIALSIDSVEDHLAWSK 


3 


3.48 


0.93 


Peroxiredoxin-6 


19893992 




TKT 


ILATPPQEDAPSVDIANIR 


3 


2.16 


0.98 


cDNA FU54957, highly similar to 


17321041 














Transketolase 






VCP 


LIVDEAINEDNSWSLSQPK 


2 


2.49 


0.98 


Transitional endoplasmic reticulum 


1 2560433 














ATPase 






VIM 


EMEENFAVEAANYQDTIGR 


3 


3.28 


0.99 


Vimentin 


19843643 


breast cancer 


EEF1D 


SLAGSSGPGASSGTSGDHGELWR 


2 


3.17 


0.93 


Elongation factor 1 -delta 


1 7997862 




HBB 


GTFATLSELHCDK 


2 


2.09 


0.97 


Hemoglobin subunit beta 


20097481 


colon cancer 


ACTN1 


GYEEWLLNEIR 


3 


2.03 


0.99 


Alpha-actinin-1 


17898132 






ACLISLGYDVENDR 


2 


2.09 


0.94 








ATP5B 


DQEGQDVLLFIDNIFR 


2 


2.58 


0.98 


ATP synthase subunit beta, 


20080835 














mitochondria 






HMGCS2 


LMFNDFLSASSDTQTSLYK 


3 


2.87 


0.93 


Hydroxymethylglutaryl-CoA synthase, 


16940161 














mitochondria 




colorectral cancer 


ATP5A1 


NVQAEEMVEFSSGLK 


2 


2.65 


0.95 


ATP synthase subunit alpha, 


9261598 














mitochondria 








EVAAFAQFGSDLDAATQQLLSR 


3 


2.88 


0.87 






Leukemia 


IDH1 


SIEDFAHSSFQMALSK 


2 


2.53 


0.97 


Isocitrate dehydrogenase [NADP] 


21205756 














cytoplasmic 




pancreatic cancer 


EPPK1 


LLEAQIATGGVIDPVHSHR 


2 


2.64 


0.97 


epiplakin 1 


18498355 


lung cancer 


FGB 


DNENWNEYSSELEK 


3 


2.57 


0.97 


Fibrinogen beta chain 


20142248 


cell migration. 


FLNB 


YAPTEVGLHEMHIK 


2 


2.02 


0.97 


soform 1 of Filamin-B 


20110358 




XRGC5 


YAPTEAQLNAVDALIDSMSLAK 


5 


3.60 


0.94 


ATP-dependent DNA helicase 2 





subunit 2 
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Table 4 Lists of differentially expressed peptides in HCC and normal sample. (Continued) 



API B1 


LAPPLVTLLSAEPELQYVALR 


2 


2.81 


0.99 


soform A of AP-1 complex subunit 
beta-1 




PLEC 


AGTLSITEFADMLSGNAGGFR 


2 


2.16 


0.89 


Isoform 1 of Plectin-1 




SDHAF2 


PAPEIFENEVMALLR 


3 


2/1 1 


0.93 


Protein EMI5 homolog, mitochondrial 




TUBA4A 


AFVHWYVGEGMEEGEFSEAR 


2 


2.40 


0.98 


Tubulin alpha-4A chain 






AVFVDLEPTVIDEVR 


2 


2.23 


0.98 






TYMP 


r\\ rr a "n /Pic i n rrACll ci/ 
UVIAI VUSLPL ASILSK 


3 


2.84 


0.93 


Thymidine phosphorylase 




UGP2 


Ti r*tf~r~i Mvirii ctax/i^ a a 11/ 
I LUbbLIMVIULt I AVbAAlK 


2 


2.98 


0.94 


Isoform 1 of UTP-glucose-1 - 
phosphate uridylyltransferase 




TUBB 


AILVDLEPGTMDSVR 


2 


1.97 


0.95 


Tubulin beta chain 




TPI 


VFNGAFTGEISPGMIK 


2 


2.52 


0.95 


Triosephosphate isomerase 
(Fragment) 




Unknown 


LFIGGLSFETTEESLR 


2 


2.64 


0.97 


Putative uncharacterized protein 
HNRNPA2B1 






SVPTSTVF Y PS DGVATE K 


3 


2.77 


0.93 


cDNA FU54957, highly similar to 
Transketolase 






RHVFGESDELIGQK 


2 


2.09 


0.96 








VFSNGADLSGVFEEAPLK 


2 


2.24 


0.90 


PR02275 




Normal sample 


ncldLcu LdNLct le 

Name 


Qh^fii in son anro 
JllULJUM JcLjUcMLfc: 


b 


XCorr 


n 

Score 


rlUlfcrIM l\dl 1 1C 


PMID C 


HCC A2M 


VSVQLEASPAFLAVPVEK 


2 


2.36 


0.93 


Alpha-2-macroglobu in 


18959789 




LLLQQVSLPELPGEYSMK 


3 


2.25 


0.96 






ACTA2 


YPIEHGIITNWDDMEK 


3 


2.42 


0.96 


Actin, aortic smooth muse e 


21214675 


ALB 


RPCFSALEVDETYVPK 


2 


2.18 


0.90 


Putative uncharacterized protein ALB 


20658536 


ALDH2 


VAEQTPLTALYVANLIK 


2 


2.55 


0.86 


Aldehyde dehydrogenase, 
mitochondrial 


20186752 


ALDH6A1 


ENTLNQLVGAAFGAAGQR 


2 


2.46 


0.89 


Methylmalonate-semialdehyde 
dehydrogenase [acylating], 
mitochondrial 


1 7786358 




LFIHESIHDEWNR 


2 


2.61 


0.96 








VNAGDQPGADLGPLITPQAK 


2 


3.27 


0.98 






ALDOB 


GILAADESVGTMGNR 


3 


2.40 


0.85 


Fructose-bisphosphate aldolase B 


1 7786358 




ELSEIAQSIVANGK 


2 


2.32 


0.96 






ASL 


INVLPLGSGAIAGNPLGVDR 


3 


3.18 


0.76 


Argininosuccinate lyase 


19138817 


ASS1 


NPWSMDENLMHISYEAGILENPK 


2 


2.74 


0.96 


Argininosuccinate synthase 


20104527 


BHMT 


ISGQEVNEAACDIAR 


2 


2.23 


0.62 


Betaine-homocysteine S- 
methyltransferase 1 


1 9960509 




AGPWTPEAAVEHPEAVR 


2 


2.62 


0.93 






C5orf33 


VATQAVEDVLNIAK 


2 


2.23 


0.97 


Isoform 2 of UPF0465 protein 
C5orf33 


21495032 


CAT 


GAGAFGYFEVTHDITK 


2 


2.17 


0.78 


Catalase 


21324921 




FNTANDDNVTQVR 


2 


2.40 


0.92 






FGG 


AIQLTYNPDESSKPNMIDAATLK 


3 


3.76 


0.92 


Fibrinogen gamma chain 


17018627 


ETFA 


LEVAPISDIIAIK 


5 


2.73 


0.89 


Electron transfer flavoprotein alpha- 
subunit 


20515076 


CPS1 


TVLMNPNIASVQTNEVGLK 


3 


2.42 


0.99 


soform 1 of Carbamoyl-phosphate 
synthase [ammonia], mitochondrial 


12143053 




FLGVAEQLHNEGFK 


3 


2.67 


0.97 








AVNTLNEALEFAK 


2 


2.58 


0.96 








VLGTSVESIMATEDR 


3 


2.22 


0.88 








IEFEGQPVDFVDPNK 


2 


2.52 


0.98 








GLNSESMTEETLK 


2 


2.63 


0.95 






CYP3A7 


EMVPI I AQYGDVLVR 


2 


2.37 


0.80 


Cytochrome P450 variant 3A7 


1 7978482 
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Table 4 Lists of differentially expressed peptides in HCC and normal sample. (Continued) 



DCI 


DADVQNFVSFISK 


3 


220 


0.99 


soform 1 of 3,2-trans-enoyl-CoA 


1 903293 












isomerase, mitochondrial 




ECHS1 


ALNALCDGLIDELNQALK 


2 


333 


0.98 


Enoyl-CoA hydratase, mitochondrial 


1 5492826 


EIF5 


AMGPLVLTEVLFNEK 


5 


2,1 1 


0.83 


Eukaryotic translation initiation factor 
5 


19175833 


FBP1 


LDVLSN DLVMNMLK 


7 


234 


0.72 


Fructose-l,6-bisphosphatase 1 


19637194 


FH 


SGLGELILPENEPGSSIMPGK 


3 


220 


0.98 


soform Mitochondrial of Fumarate 


1 958270 












hydratase, mitochondrial 






AAAEVNQDYGLDPK 


3 


223 


0.97 








IYELAAGGTAVGTGLNTR 


2 


224 


0.97 






FLNA 


ASGPGLNTTGVPASLPVEFTIDAK 


3 


2.68 


0.97 


soform 2 of Filamin-A 


21471709 


HPD 


SQIQEYVDYNGGAGVQHIALK 


2 


2.99 


0.98 


4-hyd roxyphenyl pyruvate 


8558370 












dioxygenase 




HSPA5 


SQIFSTASDNQPTvTIK 


2 


2.16 


0.97 


HSPA5 protein 


19445531 


KRT8 


LKLEAELGNMQGLVEDFK 


59 


2.08 


0.43 


Keratin, type II cytoskeletal 8 


18932288 


PBLD 


VNTENLLQVENTGK 


2 


2.33 


0.94 


Phenazine biosynthesis-like domain- 


20525558 












containing protein 




PDIA4 


EVSQPD\A/TPPPEVFLVLTK 


3 


2.49 


0.98 


Protein disulfide-isomerase A4 


19016532 


PEBP1 


GNDISSGTVLSDYVGSGPPK 


6 


3.51 


0.96 


Phosphatidylethanolamine-binding 


20739083 












protein 1 




PHB 


NITYLPAGQSVLLQLPQ 


3 


2.56 


0.86 


Prohibitin 


21318481 


PRDX6 


ELAILLGMLDPAEK 


.-! 


2.00 


0.94 


Peroxiredoxin-6 


1 9893992 


cpl PMRD1 
jcLclNDr I 


MTf^TFAPnVI ATA/nx/rtPk" 


z 


Z.UO 




cuinm tljjj/j/, [uglily similar 10 


Z I JJO/ I D 












5g|gnjyp^-Qjpfjjpg protein 1 




cpiRR<;i 
jUnDj I 


I TP\/n\/l FVl^FAIAk - 


9 
Z 


z.ul 


1 on 
.uu 


soform 9 of Sorbin and SH3 domain- 














containing protein 1 




SORD 


LENYPIPEPGPNEVLLR 


2 


1.99 


0.97 


Sorbitol dehydrogenase 


12848999 


STIP1 


ALSVGNIDDALQCYSEAIK 


2 


2.54 


0.97 


Stress-induced-phosphoprotein 1 


17627933 


TPI1 


VAHALAEGLGVIACIGEK 


2 


3.35 


0.99 


soform 2 of Triosephosphate 


18813785 












isomerase 




TXNDC5 


ALAPTWEQLALGLEHSETVK 


3 


4.01 


0.98 


Thioredoxin domain-containing 


16574106 












protein 5 




Vk3 


EIVLTQSPATLSLSPGER 


2 


2.97 


0.97 


Rheumatoid factor D5 light chain 


15207089 












(Fragment) 




ADH1A 


FSLDALITHVLPFEK 


6 


2.60 


0.92 


Alcohol dehydrogenase 1A 


16054971 




ELGATECINPQDYK 


2 


2.15 


0.94 






ADH4 


ISEAFDLMNQGK 


A 


2.95 


0.94 


Isoform 2 of Alcohol dehydrogenase 
4 


16054971 




GGVDFALDCAGGSETMK 


3 


3.25 


0.96 








FNLDALVFHTLPFDK 


8 


2.76 


0.95 








AAIAWEAGKPLCIEEVEVAPPK 


3 


2.71 


0.99 








DLHKPIQEVIIELTK 


5 


3.08 


0.99 







prostate cancer COL6A2 


YGGLHFSDQVEVFSPPGSDR 


2 


2.33 


0.86 


Isoform 2C2A' of Gollagen alpha-2(VI) 
chain 


18353764 




LLTPITTLTSEQIQK 


3 


2.57 


0.93 








VAWTYNNEVTTEIR 


5 


2.38 


0.67 








IEDGVPQHLVLVLGGK 


2 


2.01 


0.86 






RPS27A 


TITLEVEPSDTIENVK 


2 


2.23 


0.98 


ubiquitin and ribosomal protein S27a 
precursor 


1 5647830 



breast cancer 


EMILIN1 


LVGSGLHTVEAAGEAR 


2 


2.47 


0.96 


EMILIN-1 


16243817 




MYH9 


NLPIYSEEIVEMYK 


2 


2.06 


0.97 


Isoform 1 of Myosin-9 


18796164 






QLLQANPILEAFGNAK 


3 


2.80 


0.86 


Isoform 1 of Myosin-9 








IAEFTFNLTEEEEK 


13 


2.29 


0.65 


Isoform 1 of Myosin-9 




colon cancer 


ALDH1A1 


GYFVQPTVFSNVTDEMR 


3 


3.18 


0.97 


Retinal dehydrogenase 1 


21435460 




ATP5B 


7VLIMELINNVAK 


5 


3.18 


0.88 


ATP synthase subunit beta, 


20080835 



mitochondria 
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Table 4 Lists of differentially expressed peptides in HCC and normal sample. (Continued) 





ETFA 


AAVDAGFVPNDMQVGQTGK 


2 


2.13 


0.98 


Electron transfer flavoprotein subunit 


1 6708797 














alpha, mitochondria 








GTSFDAAATSGGSASSEK 


6 


2.53 


0.86 








ANXA6 


GLGTDEDTIIDIITHR 


2 


2.48 


0.98 


annexin VI isoform 2 


21137014 


Leukemia 


GLUD1 


HGGTIPIVPTAEFQDR 


2 


2.48 


0.98 


Glutamate dehydrogenase 1, 


19683518 














mitochondria 






IDH2 


LNEHFLNTTDFLDTIK 


3 


2.77 


0.98 


Isocitrate dehydrogenase [NADP], 


21205756 














mitochondria 




gastic carcinoma 


HIST4H4 


TvTAMDWYALK 


2 


2.03 


0.96 


Histone H4 


19139817 


colorectal cancer 


RRBP1 


TLQEQLENGPNTQLAR 


2 


2.74 


0.88 


Isoform 3 of Ribosome-binding 


1 9425502 














protein 1 




pancreatic cancer 


ARG1 


TGLLSGLDIMEVNPSLGK 


4 


2.71 


0.91 


soform 1 of Arginase-1 


21347333 




CALM1 


VFDKDGNGYISAAELR 


3 


2.50 


0.93 


Calmodulin 


18852131 






EAFSLFDKDGDGTITTK 


2 


2.62 


0.98 






ovarian cancer 


HAAO 


TQGSVALSVTQDPACK 


2 


2.56 


0.91 


soform 1 of 3-hydroxyanthranilate 


1 9724865 














3,4-dioxygenase 




Lung cancer 


ACY1 


TVQPKPDYGAAVAFFEETAR 


2 


2.50 


0.99 


cDNA FU60317, highly similar to 


8394326 














Aminoacylase-1 




cell migration. 


FLNB 


LVSPGSANETSSILVESVTR 


2 


3.21 


0.99 


Isoform 1 of Filamin-B 


19915675 




UGP2 


ILTTASSHEFEHTK 


2 


3.30 


0.93 


Isoform 1 of UTP— glucose-1 - 
















phosphate uridylyltransferase 








IQRPPEDSIQPYEK 


4 


2.38 


0.95 








Al nHAAl 


cLIrur V Lj V ivirLJLJt\Tt\ 


■D 
J 






Delta-1 -pyrroline-5-carboxylate 
















dehydrogenase, mitochondrial 






r~ril 1 A A 1 




J 


z.jy 


n 77 


Isoform 1 of Collagen alpha-1 (XIV) 
chain 






UL I Nz 


1 f~ D P\ A A 1 M 1 TP\ D P\/~ A 1 Al/ 

LLbrUAAMNLI UrUCjALA^ 


2 


2.24 


0.94 


dynactin 2 






EEF1 B2 


rnsr ry\ /I MP\VI APll/ 

brACjLUVLNUYLADK 


3 


2.86 


0.84 


Elongation factor 1-beta 






rnuDD 


A A A C~ 1 P\\ /TC D C D 1 DTN 1 Ul D 1 1 Tl V 

AAAbLUV 1 brtrLr 1 NnrLL 1 


3 


3.1 1 


0.99 


Glyoxylate reductase/hydroxypyruvate 
















reductase 






HSD17B10 


VMTIAPGLFGTPLLTSLPEK 


3 


2.80 


0.91 


soform 1 of 3-hydroxyacyl-CoA dehydrogenase 














type-2 






PCBD1 


VHITLSTHECAGLSER 


2 


2.54 


0.96 


Pterin-4-alpha-carbinolamine 
















dehydratase 






PDIA6 


GSTAPVGGGAFPTIVER 


3 


2.05 


0.87 


soform 2 of Protein disulfide- 
















isomerase A6 






PTGR1 


HFVGYPTNSDFELK 


2 


2.24 


0.93 


Prostaglandin reductase 1 








TGPLPPGPPPEIVIYQELR 


7 


2.56 


0.96 








TF 


SAGWNIPIGLLYCDLPEPR 


3 


2.65 


0.97 


Serotransferrin 








EDPQTFYYAVAWK 


A 


2.49 


0.92 








unknown 


PAHVWGDVLQAADVDK 


2 


2.88 


0.96 


22 kDa protein 








HCC and normal sam 


pie 








Related Cancer 


Gene 


Shogun Sequence 


#(HCC) a 


XCorr 


Q 


Protein Name 


PMID C 




Name 




/# 




Score 












(normal) 6 










HCC 


CPS1 


MEYDGILIAGGPGNPALAEPLIQNVR 


2/11 


3.92 


0.91 


carbamoyl-phosphate synthetase 1 


12143053 






SIFSAVLDELK 


1/8 


3.87 


0.92 










IAPSFAVESIEDALK 


3/13 


2.96 


0.85 










TAVDSGIPLLTNFQVTK 


1/10 


2.50 


0.45 








HBA1 


VADALTNAVAHVDDMPNALSALSDLHAHK 


1/8 


3.67 


0.93 


Hemoglobin subunit alpha 1 


20572306 






VGAHAGEYGAEALER 


4/13 


2.05 


0.94 








P4HB 


ILFIFIDSDHTDNQR 


10/15 


2.88 


0.49 


Protein disulfide-isomerase 


21207424 




HNRNPC 


MIAGQVLDINLAAEPK 


21/9 


2.31 


0.46 


Heterogeneous nuclear 


20572306 



ribonucleoprotein C (C1/C2), isoform 
CRA_b 
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Table 4 Lists of differentially expressed peptides in HCC and normal sample. (Continued) 





PGK1 


VSHVSTGGGASLELLEGK 


16/8 


3.36 


0.46 


Phosphoglycerate kinase 1 


19200351 




ACTB 


DLYANTVLSGGTTMYPGIADR 


10/3 


3.25 


0.96 


Actin, cytoplasmic 1 


1 6493704 




GSTA1 


NDGYLMFQQVPMVEIDGMK 


2/6 


2.24 


0.83 


Glutathione S-transferase 


20604928 




FABP1 


SVTELNGDIITNTMTLGDIVFK 


17/6 


3.43 


0.78 


Fatty acid-binding protein 


1 2245374 




CES1 


EGYLQIGANTQAAQK 


13/1 


2.21 


0.76 


Isoform 1 of Liver carboxylesterase 1 


19658107 


Breast Cancer 


LGALS7/ 
LGALS7B 


LVEVGGDVQLDSVR 


19/1 


2.35 


0.65 


Galectin-7/p53-induced gene 1 
protein 


20382700 




HBB 


FFESFGDLSTPDAVMGNPK 


39/67 


2.72 


0.74 


Hemoglobin subunit beta 


20097481 




MDH2 


VDFPQDQLTALTGR 


4/7 


2.49 


0.93 


Malate dehydrogenase 2 


1 9485423 




MYH9 


LQQELDDLLVDLDHQR 


9/15 


2.54 


0.56 


Myosin, heavy polypeptide 9, non- 
muscle, isoform CRA_a 


18796164 


Ovarian cancer 


PSMA2 


YNEDLELEDAIHTAILTLK 


3/5 


4.48 


0.84 


Proteasome subunit alpha type-2 


14960231 


Lung cancer 


AKR1A1 


DPDEPVLLEEPWLALAEK 


3/5 


3.16 


0.63 


Aldo-keto reductase family 1 


17114299 


Chromophobe 
renal cell 


ATP5H 


NLIPFDQMTIEDLNEAFPETK 


3/5 


2.48 


0.95 


ATP synthase subunit d, 
mitochondria 


20440404 


carcinomas 
















Leukemia 


IGKC 


VDNALQSGNSQESVFEQDSK 


3/6 


3.95 


0.92 


Ig kappa chain C region 


12357370 




RPS7 


TLTAVHDAILEDLVFPSEIVGK 


5/3 


3.92 


0.92 


40S ribosomal protein S7 





a the number of spectral sets in HCC samples 
b the number of spectral set in normal samples. 
c the PubMed index for MEDLINE 

Table 4 shows lists of DEPs in HCC sample, normal sample, and both samples. In HCC sample and normal sample, 57 and 115 reference spectra were identified 
by SEQUEST. Among these spectra, 29 and 59 peptides were known biomarkers for the human liver cancer. In both sample, we performed a beta-binomial test 
for finding out DEPs. The result shows that only 84 out of 1,571 reference spectra indicate heterogeneity of spectral counts between HCC and normal tissue 
samples. Among these 84 reference spectra, only 22 were identified by SEQUEST. 



As shown in Table 4 many peptides are also known to 
be associated with cancer. Specifically, EMILIN-1 (EMI- 
LIN1), elongation factor 1-delta (EEF1D), galectin-7/p53- 
induced gene 1 protein (LGALS7), hemoglobin subunit 
beta (HBB) and malate dehydrogenase 2 (MDH 2) are dif- 
ferentially expressed in breast cancer cells [29-31]. Conse- 
quently, the LGALS7 gene is known to be related to over- 
expression when compared with control cells. Likewise, 
our result was also over-expressed. Table 4 provides a list 
of different types cancers associated with specific genes 
[28-34]. Figure 4 shows a scatter plot of the spectral 
counts of normal and HCC samples. The x axis and y axis 
represent the number of expressed spectra in each HCC 
and normal sample. Specifically, the symbol " A " indicates 
DEPs identified with the use of SEQUEST, whereas the 
symbol "•" indicates unidentified DEPs. However, 62 
DEPs were not identified by SEQUEST despite their signif- 
icant differences by the beta-binomial test. 

We believe there were several reasons why 62 DEPs 
were not identified by SEQUEST. First, "one-size-fits-all" 
search parameter values of SEQUEST would not have 
been chosen appropriately for this protein target. Second, 
these unidentified DEPs may have other post-transla- 
tional modification, sequence variation (e.g., alternative 
splicing) or insufficient peptide ions information. 

We re-run SEQUEST with many different parameter 
options for allowing phosphorylation modification and 
two missed cleavages, and for using other sequence 



databases (NCBI nr and EST human). However, even 
with these parameter options, SEQUEST did not identify 
the remaining 62 DEPs. Next, we tried to identify 62 
reference spectra using other searching engines such as 
MASCOT and SpectraST. MASCOT identified 2 DEPs, 
Alcohol dehydrogenase 1A (ADH1A) and Isoform 2 of 
Myosin-9(MYH9) but SpectraST did not identify any 
DEPs. The remaining 60 DEPs could not be identified 
by these search engines. In order to identify these DEPs, 
further experiments may be needed. For example, addi- 
tional MS/MS experiments such as MRM (Multiple 
Reaction Monitoring) or SRM (Selective Reaction Moni- 
toring) can be carried out within the range of the corre- 
sponding retention times for all the unidentified spectra 
in order to collect more detailed peptide information. 

Conclusions 

In this paper, we proposed a novel method to estimate 
peptide's abundance by counting MS/MS spectra clus- 
tered through the direct comparison of all experimentally 
observed spectra. For a given pair of spectra, our method 
can be used to answer the question of whether they are 
from the same peptide without computationally search- 
ing them from a theoretical library of protein spectra. 
Examining all possible pair-wise comparisons, our 
method results into a set of spectra for the same peptide 
and enables us to estimate the amount of peptides found 
in biological samples of interest by counting the spectra 



Lee et al. BMC Bioinformatics 201 1, 12:423 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/423 



Page 1 3 of 1 8 



o 



O 

in 



o 



o 

* CO 



o 



♦ ♦ 



+ + 

+ t 

; ±J ::: i£H*i ♦ ♦ ♦ 



• unidentified 
identified 



1 

1 0 



20 



- 1 

30 

#(HCC) 



40 



50 



60 



Figure 4 Scatter plot of spectral counts between normal and HCC samples. This figure plots the number of spectra in clustered sets in 
HCC and normal sample, respectively. The x axis and y axis represent the number of expressed spectra in each HCC and normal sample. 
Specifically, the grey triangle indicates DEPs identified with the use of SEQUEST, whereas the black circle indicates unidentified DEPs. 



clusters. Since our proposed method compares all possi- 
ble pairs of experimental spectra, it can discover even 
modified and unknown peptides, which may not be 
searchable from a theoretical spectral library. For practi- 
cal MS/MS experimental data, a large proportion of spec- 
tra are often misidentified or completely lost during a 
computational database search. On the other hand, Q- 
FISH can identify these spectra without any loss of infor- 
mation. As demonstrated in our practical examples, the 
majority of DEPs derived by Q-FISH were found to be 
highly related with various cancers, which were not dis- 
covered by other methods. 

We thus believe our Q-FISH algorithm will be highly 
useful in the identification of novel peptides [19]. Also, Q- 
FISH has the potential to find applications in many other 
practical proteomic studies. For example, it can be used to 
discover unknown biomarkers or drug targets through the 
comparison of proteins with statistically significant differ- 
ence and by quantifying sets of identical peptides in multi- 
ple samples. Unknown spectral clusters can often come 
from non-peptide contaminants as revealed by a recent 
publication [35]. Q-FISH can evaluate the significance of 



such unknown clusters, some of which can be novel bio- 
markers, requiring further experimental confirmation by 
de novo sequencing, unrestricted sequence database 
search (using e.g. InsPect [36]) or spectral library search 
(using e.g. pMatch [37]). 

Methods 

Sample Preparation, Nano-LC-ESI-MS/MS 

Tissue samples such as hepatocellular carcinoma (HCC) 
tumour tissue and adjacent healthy liver tissue were col- 
lected under the guidelines of the Institutional Review 
Board (IRB) established at Yonsei Medical Center (Seoul, 
Korea). All tissues were prepared and subsequently, in- 
solution tryptic digestion was performed as previously 
described [20]. Nano-LC-MS/MS analysis was performed 
on an Agilent Nano HPLC 1100 system using an linear 
trap quadruple (LTQ) mass spectrometer (Thermo Elec- 
tron, San Jose, US). LC-MS/MS was performed as pre- 
viously described [38]. The peptide fractionation was 
performed by means of cationic exchange chromatogra- 
phy (SCX) at a flow rate of 0.5 mL/min where absorbance 
of the column effluent was maintained stable at 280 nm 
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for 40 min. Fractions were automatically transferred 
every 0.5 min into a 96-microplate. 

Nano-LC MS/MS experiments were carried out three 
times on two different samples (human liver cancer and 
normal tissues) and 44,318 MS/MS spectra were gener- 
ated. These tandem mass spectrometry data were first 
analyzed by means of the database search software 
SEQUEST (Bioworks 3.2, ThermoFinnigan, San Jose, 
US). The sequence database downloaded from European 
Bioinformatics Institute (EBI) was the International Pro- 
tein Index (IPI) human version 3.61. The next step was 
to combine the protein sequence database with its 
reverse sequences. The maximum number of missed 
cleavage sites was set to 1, and only tryptic cleavage after 
arginine and lysine was allowed. The mass tolerance of 
the precursor peptide ion was set to 3.0 Da, while the 
fragment ion tolerance was set to 0.5 Da. These tolerance 
values were chosen to minimize FDR when XCorr > 1.5 
[39]. Modification at cysteine with carboxyamidomethy- 
lation and methionine with oxidation were allowed [40] . 
All peptides assigned to reverse sequence were removed 
before proceeding to peptide identification to inhibit 
false-positive identifications. We chose XCorr as 1.44 
(+1), 1.97(+2) and 3.13(+3) which yielded FDR close to 

0. 05, respectively, and the value of DeltaCn is equal to a 
great than 0.1. These score criteria were considered to 
ensure high confidence in the results of protein identifi- 
cation [41]. The spectra derived by mass spectrometry 
were also analyzed by means of the spectral library search 
software SpectraST, which was initially developed by the 
Institute for Systems Biology (ISB) and National Institute 
of Standards and Technology (NIST). SpectraST is inte- 
grated with the Trans-Proteomic Pipeline (TPP) software 
suite, which provides the supporting functionalities 
necessary in a full proteomics data analysis pipeline. 
Then, the SpectraST program was validated in the NIST 
Human IT Library with the SpectraST's scores > 0.9 
[18,38,42]. The precursor tolerance was set to 1.5 Da/z 
(Thomson). 

Q-FISH algorithm for direct comparison of experimental 
spectra 

We assumed that MS/MS spectra from the same pep- 
tide would present similar patterns. Under this assump- 
tion, the proposed Q-FISH algorithm can be applied to 
find DEPs both in normal and disease samples. As 
shown in Figure 1, to evaluate the similarities between 
two spectra, we use a correlation coefficient of the mov- 
ing window averages. The analytical process is summar- 
ized as follows: 

1. Scale Standardization 

Perform scale standardization by dividing the intensity 
values by its maximum value. 



2. Moving average 

Compute the moving window average over the spectra 
using a window of fixed size. 

3. Correlation index for moving average-based peak 
patterns 

Calculate a summary statistic based on the correlation 
coefficient of the moving averages between two spectra. 

4. Spectral count-based quantification using two-stage 
clustering 

Cluster duplicated peptides with similar peak patterns 
and retention time using a two-stage clustering method. 

5. Identification of differentially expressed peptides 
Employ the beta-binomial test to identify DEPs among 
the experimental groups. 

Similarity measure between pairs of MS/MS spectra 
Scale standardization 

Because the intensities of the spectra obtained may be 
different for various physical and chemical reasons such 
as inconsistencies in the total ion currents, we cannot 
use the raw data for the intensity of m/z peaks. In light 
of this, we used a scale-standardization method, which 
involves the division of the m/z peak values for all ions 
by their maximum value. Let x[i] be the intensity of the 
f m/z peak. Then, the scale standardized intensity, y[i], 
is defined by 



max(x[i]) 
Moving window average 

To reduce the background noise of the peak intensities, 
the moving window average (MWA) is used. The most 
simple moving average is the unweighted (or uniformly 
weighted) average of n data points within a given win- 
dow, and the weighted moving average (WMWA) is the 
average calculated using multiplying weight factors to 
give different weight to each data point. Among the var- 
ious options for the weights of WMWA, we selected the 
"Gaussian" kernel, which uses the probability density 
function (pdf) of the standard Gaussian distribution 
with mean 0 and variance 1 as a weight function. 

For a given spectrum, the MWA is calculated by aver- 
aging the peak intensities within the sliding window 
sequentially for all m/z peaks. In other words, the 
MWA is not a single value, but a set of averages. The 
next step is to calculate correlation between the MWAs 
of two spectra and determine whether there are identi- 
cal spectra from the same peptide. 

We assume that there are N moving windows of fixed 
size K along the entire m/z range. Subsequently, the 
WMWA for the i th moving window (i = I, 2,..., N) is 
defined by 
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mi 



;=0 



where y[i + /'] is the / scale standardized intensity in the 
i moving window and w ; are the weights. For a uniform 
kernel Wj = l/K or the Gaussian kernel, Wj = 0(zj) repre- 
sents the pdf of the standard Gaussian distribution, where 
Zj represents the value of y[i+j] standardized by using 
mean and variance of m/z's in the i window. Total num- 
ber of windows, N can be determined by the fixed window 
size K along with the entire m/z range (200-2000 Da). In 
order to determine the optimal window size, we randomly 
selected some pairs of spectra from the same and different 
peptides using target-decoy sequence database. We imple- 
mented receiver operating characteristic (ROC) analysis to 
determine the window size. Based on ROC analysis, we 
chose a window size, K = 30 (3.0Da) and accordingly N = 
19,771 (20-2000 Da at interval of 0.1 Da). However, the 
areas under the curve (AUC) did not differ much and 
were less sensitive to the window size. 
Correlation index for moving average-based peak patterns 
For peptides p and q, the correlation coefficient is com- 
puted as follows: 



Ei=i {m P \i] - m p ){m q \i] - m q ) 



Em ( m p( l 'l " mp) VEi=i (»»<?[«] " ™<?) 



where m^and wiq are the means of moving window 
averages for peptide p and q. The closer the correlation 
coefficient is to 1, the stronger is the correlation between 
spectra from the same peptides. 

Quantification by counting spectra in clustered spectra 
set from a homogenous peptide 

Two-stage cluster analysis is used to cluster peptide sets 
consisting of spectra with similar patterns. As previously 
assumed, if the spectra have approximately the same 
shape, then the spectra would have come from the same 
peptide. Namely, each cluster can be expected to be com- 
posed of the spectra obtained from a homogenous peptide. 
Two-stage clustering analysis employs two similarity mea- 
sures to cluster peptides: the first is the difference between 
precursor ions and the second is the correlation coefficient 
between two MWAs. It is theoretically predicted that MS/ 
MS spectra obtained from the same peptide have similar 
precursor ions. First, clusters can be defined in terms of 
pair-wise differences between the precursor ions. For any 
two pair of precursor ions in the same cluster, their differ- 
ence is smaller than the threshold value. In our analysis, 
we set ± 1 Da as a threshold value. The next step is to per- 
form a hierarchical clustering analysis for each of the clus- 
ters defined. Specifically, we employ "single linkage," also 



known as the nearest neighbour technique. Here, the cor- 
relation coefficient of MWAs is used as a similarity 
measure. 

Because this two-stage clustering analysis yields clus- 
tered spectra sets consisting of MS/MS spectra from the 
same peptide, the amount of peptides can be quantified 
by counting the spectra included in each clustered set. 
Lastly, representative spectra called "reference spectra" 
can be defined based on the basic patterns of precursor 
ions as the average spectra for a given spectral set. 

Validation of the clustering results using retention times 

It is well known that the same peptides tend to elute 
continuously within a limited liquid chromatography 
(LC) interval. Thus, the clustering results can be vali- 
dated using the retention time (RT) information. 

In order to validate the clustering results, we propose 
a new measure to estimate the clustering error rate 
using the spectral RT information. Note that the Q- 
FISH results provide the list of clusters. If a cluster con- 
tains only peptides from the same spectra, the RTs of 
peptides would have similar values. If a cluster contains 
peptides from the different spectra, the RTs would have 
different values. As a measure of similarity, we consider 
the measures representing the variability of RTs from 
the same cluster such as coefficient of variation (CV) 
and standard deviation (SD) of RTs. Since the RT varies 
much across of spectra, CV would be a better measure 
than SD. Using CV, we propose a new measure called 
the false clustering rate (FCR) which is similar in spirit 
to that of the false discovery rate (FDR). It measures the 
rate how often a cluster is composed of spectra from 
the different peptides. We provide a threshold value of 
CV, A, to determine whether a cluster is well clustered 
or not. That is, if the value of CV of a given cluster is 
smaller than A, then we call it is a good cluster. For the 
given value of A, FCR can be computed. The detailed 
procedure of computing FCR is given as follows: 

1) Calculate the coefficient of variation (CV) of spec- 
tral RT in the same clusters from the Q-FISH 
results. 

2) Permute the spectra while maintaining the num- 
ber of spectra in each cluster fixed. 

3) Calculate CV P for each permuted cluster for the 
pth permuted sample. 

4) Compute FCR as follows: 



FCR = 



V Hp=i #{i\CV p {i) < A} 
#{i\CV(i) < A} 



i= 1,2, ••• ,C, 



where P is the number of permutations, A the thresh- 
old value, and C the total number of clusters. 
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Table 5 Validation for clustering result using the false 
clustering rate (FCR) 



FCR using RT information 



FCR for the cut-off value 



FCR 



FCR 



1 


0.0288 


0.0 


1 .0000 


2 


0.0307 


0.1 


0.9486 


3 


0.0380 


0.2 


0.8060 


4 


0.0467 


0.3 


0.6525 


4.4 


0.0500 


0.4 


0.4515 


5 


0.0553 


0.5 


0.3178 


6 


0.0639 


0.6 


0.0251 


7 


0.0719 


0.7 


0.0034 


8 


0.0806 


0.8 


0.0008 


9 


0.0895 


0.9 


0.0003 


10 


0.0981 


1.0 


0.0000 



In order to validate the clustering results, we propose a new measure to 
estimate the clustering error rate using the spectral retention time (RT) 
information. We computed the false clustering rate (FCR) for various values of 
threshold A, as summarized. We also calculated FCR to determine the cut-off 
value of correlation coefficient for spectral clustering. We computed FCR for 
the various values of the given p, as summarized. We chose p = 0.6 which 
yielded FCR close to 0.05. 

For our HCC data, we computed FCR for various 
values of A, as summarized in the Table 5. From our 
analysis, we chose the value of A as 4.4 which yielded 
FCR close to 0.05. 

We also calculated FCR to determine the cut-off value 
of correlation coefficient, p for spectral clustering. For 
the given threshold value of p, FCR can be computed in 
the similar manner as A. We computed FCR for the var- 
ious values of the given p, as summarized in the Table 5. 
We chose p = 0.6 which yielded FCR close to 0.05. 

Differentially expressed peptides (DEPs) 

To estimate the peptide's abundance found in different 
samples such as control and disease tissue samples, a 
spectral counting method like Q-FISH can be employed. 
Pham et al. [21] proposed the use of the beta-binomial 
distribution to test the significance of DEPs in spectral 
counts in label-free mass spectrometry-based proteomics. 
Their results revealed that the beta-binomial test can be 
applied to experiments with one or more replicates, as 
well as for the comparison of multiple conditions. We 
applied the beta-binomial model to test the abundance of 
DEPs in the clustered spectral set through three repli- 
cated MS/MS experiments. 

Let x denote the number of spectral counts in the 
clustered spectral set and n, the total number of spectral 
counts of all spectral in each sample. Then, assume that 
x is distributed with the true proportion n, 0 < n < 1, 

x\it ~ Binomial(n,jt) 

Differently, n is approximated as a random variable 
based on the beta distribution with real parameters a > 



0 and p > 0. 

n ~ Beta(a, P), E (jt) 



a + R 



War (tt) 



(a +P) z (a+P + 1) 



Subsequently, the marginal distribution of x is the 
beta-binomial distribution [21], 

p (x\a, ft, n) = I p{x\n ,n)p{jt\a, p)dn 
Jo 

r x+a—l 



n\ it 

X 



0 

Jo W B («//?) 

n\ B{a + x, n + P — x) 
x ) B (a, 0) 

where B{-,-) is the beta function. 

The following parameterization is used 

h{Xb) = h{r]) and <p = 



n— x+fi— 1 ^ 



-dm , 



1 



a+P 



a + P + l 



where h is the inverse of the link function (logit or 
complementary log-log), X a design matrix, b a vector of 
fixed effects, rj = Xb the linear predictor, and <T> the 
overdispersion parameter. Based on this parameteriza- 
tion, the marginal mean and variance are: 

E (x) = n ■ it 

Var(x) = n ■ tt ■ (1 — tt) ■ [l + (n — 1) ■ 0] . 

It should be noted that parameters b and <p are esti- 
mated by maximizing the log-likelihood of the marginal 
model. Given the estimated coefficients, the testing 
hypothesis is rephrased as to whether the b coefficient is 
0 [43]. We also used Benjamini and Hochberg's method 
to correct for multiple comparisons in multiple testing 
for DEPs [44]. 

Additional material 



Additional file 1: Lists for identified peptides reported in the 
literature. In order to compare the performance of Q-FISH with the 
spectral counting method by SEQUEST, we used the human liver data 
and validated the results through literature search. For the human liver 
data, Q-FISH provided 1571 differentially expressed clusters for HCC 
sample and 1556 for normal sample, among which 57 and 99 clusters 
were identified by SEQUEST in HCC and normal samples, respectively. On 
the other hand, SEQUEST provided 93 and 145 peptides for HCC and 
normal tissue samples, respectively. 



Lee ef al. BMC Bioinformatics 201 1, 12:423 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/423 



Page 17 of 18 



Acknowledgements 

The work of TP was supported by the National Research Foundation (KRF- 
2008-3 13-C00086) and the Brain Korea 21 Project of the Ministry of 
Education. The work of JKL was supported in part by the US Nationa 
Institutes of Health (R01HL081690). 

Author details 

'Department of Statistics, Seoul National University, Korea, interdisciplinary 
program in Bioinformatics, Seoul National University, Korea, department of 
Biochemistry, Yonsei Proteome Research Center and Biomedical Proteome 
Research Center, Yonsei University, Korea. 4 Center for Genomics and 
Bioinformatics, Indiana University, USA. department of Public Health 
Sciences, Division of Biostatistics, University of Virginia School of Medicine, 
USA. 

Authors' contributions 

SML and MSK performed the statistical analysis and drafted the manuscript. 
HJL, YKP, and HT carried out mass spectrometry experiments. JKL and TP 
conceived of the study, and participated in coordination. All authors write, 
read and approved the final manuscript. 

Competing interests 

The authors declare that they have no competing interests. 

Received: 7 February 201 1 Accepted: 28 October 201 1 
Published: 28 October 201 1 

References 

1 . Hernandez P, Markus M, Appel RD: Automated protein identification by 
tandem mass spectrometry: issue and strategies. Mass Spectrometry 
Reviews 2006, 25:235-254. 

2. Nesvizhskii Al, Vitek 0, Aebersold R: Analysis and validation of proteomic 
data generated by tandem mass spectrometry. Nature Methods 2007, 
4:787-797. 

3. Washburn MP, Ulaszek RB, Yates JR: Reproducibility of quantitative 
proteomic analyses of complex biological mixtures by multidimensional 
protein identification technology. Anal Chem 2003, 75:5054-5061 . 

4. Ong SE, Mann M: Mass spectrometry-based proteomics turns 
quantitative. Nat Chem Biol 2005, 1:252-262. 

5. Wang M, You J, Bemis KG, Tegeler TJ, Brown DP: Label-free mass 
spectrometry-based protein quantification technologies in proteomic 
analysis. Briefings in Functional Genomics and Proteomic 2008, 7(5)329-339. 

6. Old WM, Meyer-Arendt K, Aveline-Wolf L, Pierce KG, Mendoza A, 
Sevinsky JR, Resing KA, Ahn NG: Comparison of label-free methods for 
quantifying human proteins by shotgun proteomics. Mol Cell Proteomics 
2005, 4(10):1487-1502. 

7. Little KM, Lee JK, Ley K: ReSASC: a resampling-based algorithm to 
determine differential protein expression from spectral count data. 
Proteomics 2010, 10:1212-1222. 

8. Jaffe JD, Mani DR, Leptos KC, Church GM, Gillette MA, Carr SA: PEPPeR, a 
platform for experimental proteomic pattern recognition. Mol Cell 
Proteomics 2006, 5:1927-1941. 

9. Li XJ, Yi EC, Kemp CJ, Zhang H, Aebersold R: A software suite for the 
generation and comparison of peptide arrays from sets of data 
collected by liquid chromatography-mass spectrometry. Mol Cell 
Proteomics 2005, 4:1328-1340. 

10. Breukelen B, Toorn HW, Drugan MM, Hec AJ: StatQuant: a post- 
quantification analysis toolbox for improving quantitative mass spectro- 
metry. Bioinformatics 2009, 25:1472-1473. 

1 1 . Mann B, Madera M, Sheng Q, Tang H, Mechref Y, Novotny MV: 
ProteinQuant Suite: a bundle of automated software tools for label-free 
quantitative proteomics. Rapid Commun Mass spectrum 2008, 
223823-3834. 

12. Zhang H, Yi EC, Li XJ, Mallick P, Kelly-Spratt KS, Masselon CD, Camp DG, 
Smith RD, Kemp CJ, Aebersold R: High throughput quantitative analysis of 
serum proteins using glycopeptide capture and liquid chromatography 
mass spectrometry. Mol Cell Proteomics 2005, 4:144-155. 

13. Radulovic D, Jelveh S, Ryu S, Hamilton TG, Foss E, Mao Y, Emili A: 
Informatics platform for global proteomic profiling and biomarker 
discovery using liquid chromatography-tandem mass spectrometry. Mol 
Cell Proteomics 2004, 3:984-997. 



14. Prakash A, Mallick P, Whiteaker J, Zhang H, Paulovich A, Flory M, Lee H, 
Aebersold R, Schwikowski B: Signal maps for mass spectrometry-based 
comparative proteomics. Mol Cell Proteomics 2006, 5:423-432. 

15. Fischer B, Grossmann J, Roth V, Gruissem W, Baginsky S, Buhmann JM: 
Semi-supervised MC/MS alignment for differential proteomics. 
Bioinformatics 2006, 22:el32-e140. 

16. Kapp E, Schutz F: Overview of tandem mass spectrometry (MS/MS) 
database search algorithms. Current Protocols in Protein Science 2007, 
25:25.2.1-25.2.19. 

17. Nesvizhskii A, Roos FF, Grossmann J, Vogelzang M, Eddes JS, Gruissem W, 
Baginsky S, Aebersold R: Dynamic spectrum quality assessment and 
iterative computational analysis of shotgun proteomic data: toward more 
efficient identification of post-translational modifications, sequence 
polymorphisms, and novel peptides. Mol Cell Proteomics 2006, 5:652-670. 

18. Nesvizhskii A: A survey of computational methods and error rate 
estimation procedures for peptide and protein identification in shotgun 
proteomics. J of Proteomics 2010, 73:2092-2123. 

19. Lam H, Deutsch E, Eddes J, Eng JK, King N, Stein SE, Aebersold R: 
Development and validation of a spectral library searching method for 
peptide identification from MS/MS. Proteomics 2007, 7:655-667. 

20. Beer I, Barnea E, Ziv T, Admon A: Improving large-scale proteomics by 
clustering of mass spectrometry data. Proteomics 2004, 4(4)350-960. 

21. Pham TV, Piersma SR, Warmoes M, Jimenez CR: On the beta-binomial 
model for analysis of spectral count data in label-free tandem mass 
Spectrometry-based proteomics. Bioinformatics 2010, 26(3)363-369. 

22. Seriramalu R, Pang WW, Jayapalan JJ, Mohamed E, Abdul-Rahman PS, 
Bustam AZ, Khoo AS, Hashim OH: Application of champedak mannose- 
binding lectin in the glycoproteomic profiling of serum samples unmasks 
reduced expression of alpha-2 macroglobulin and complement factor B in 
patients with nasopharyngeal carcinoma. Electrophoresis 2010, 31:2388-2395. 

23. Kinoshita M, Miyata M: Underexpression of mRNA in human 
hepatocellular carcinoma focusing on eight loci. Hepatology 2002, 
36(2)433-438. 

24. Fu LY, Jia HL, Dong QZ, Wu JC, Zhao Y, Zhou HJ, Ren N, Ye QH, Qin LX: 
Suitable reference genes for real-time PCR in human HBV-related 
hepatocellular carcinoma with different clinical prognoses. BMC Cancer 

2009, 9:49. 

25. Hwang TL, Liang Y, Chien KY, Yu JS: Overexpression and elevated serum 
levels of phosphoglycerate kinase 1 in pancreatic ductal 
adenocarcinoma. Proteomics 2006, 6(7)2259-2272. 

26. Li Y, Wan D, Wei W, Su J, Cao J, Qiu X, Ou C, Ban K, Yang C, Yue H: 
Candidate genes responsible for human hepatocellular carcinoma 
identified from differentially expressed genes in hepatocarcinogenesis 
of the tree shrew (Tupaia belangeri chinesis). Hepatol Res 2008, 
38(1):85-89. 

27. Elchuri S, Naeemuddin M, Sharpe O, Robinson WH, Huang TT: Identification 
of biomarkers associated with the development of hepatocellular 
carcinoma in CuZn superoxide dismutase deficient mice. Proteomics 
2007, 7(12)2121-2129. 

28. Na K, Lee EY, Lee HJ, Kim KY, Lee H, Jeong SK, Jeong AS, Cho SY, Kim SA, 
Song SY, Kim KS, Cho SW, Kim H, Paik YK: Human plasma carboxylesterase 
1, a novel serologic biomarker candidate for hepatocellular carcinoma. 
Proteomics 2009, 93989-3999. 

29. Demers M, Rose AA, Grosset AA, Biron-Pain K, Gaboury L, Siegel PM, St- 
Pierre Y: Overexpression of galectin-7, a myoepithelial cell marker, 
enhances spontaneous metastasis of breast cancer cells. Am J Pathol 

2010, 176(6)3023-3031. 

30. Schulz DM, Bdllner C, Thomas G, Atkinson M, Esposito I, Hdfler H, Aubele M: 
Identification of differentially expressed proteins in triple-negative 
breast carcinomas using DIGE and mass spectrometry. J Proteome Res 
2009, 8(7)3430-3438. 

31. Pau Ni IB, Zakaria Z, Muhammad R, Abdullah N, Ibrahim N, Aina Emran N, 
Hisham Abdullah N, Syed Hussain SN: Gene expression patterns 
distinguish breast carcinomas from normal breast tissues: the Malaysian 
context. Pathol Res Pract 2010, 206(4)223-228. 

32. Yue W, Sun LY, Li CH, Zhang LX, Pei XT: Screening and identification of 
ovarian carcinomas related genes. Ai Zheng 2004, 23(2):141-145. 

33. Hirata T, Yamamoto H, Taniguchi H, Horiuchi S, Oki M, Adachi Y, Imai K, 
Shinomura Y: Characterization of the immune escape phenotype of 
human gastric cancers with and without high-frequency microsatellite 
instability. J Pathol 2007, 211(5)516-523. 



Lee ef al. BMC Bioinformatics 201 1, 12:423 
http://www.biomedcentral.eom/1 471 -2 1 05/1 2/423 



Page 18 of 18 



34. Yusenko MV, Ruppert T, Kovacs G: Analysis of differentially expressed 
mitochondrial proteins in chromophobe renal cell carcinomas and renal 
oncocytomas by 2-D gel electrophoresis. Int J Biol Sci 2010, 6(3):21 3-224. 

35. Fu Y, Xiu LY, Jia W, Ye D, Sun RX, Qian XH, He SM: DeltAMT: A Statistical 
Algorithm for Fast Detection of Protein Modifications From LC-MS/MS 
Data. Mol Cell Proteomics 201 1, 10(5):M1 10.000455. 

36. Tanner S, Shu H, Frank A, Wang LC, Zandi E, Mumby M, Pevzner PA, 
Bafna V: InsPecT: identification of posttranslationally modified peptides 
from tandem mass spectra. Anal Chem 2005, 77(14)4626-39. 

37. Ye D, Fu Y, Sun RX, Wang HP, Yuan ZF, Chi H, He SM: Open MS/MS 
spectral library search to identify unanticipated post-translational 
modifications and increase spectral identification rate. Bioinformatics 
2010, 26(12):i399-406. 

38. Lee HJ, Kang MJ, Lee EY, Cho SY, Kim H, Paik YK: Application of a peptide- 
based PF2D platform for quantitative proteomics in disease biomarker 
discovery. Proteomics 2008, 8(16)3371-3381. 

39. Kapp EA, Schutz F, Connolly LM, Chakel JA, Meza JE, Miller CA, Fenyo D, 
Eng JK, Adkins JN, Omenn GS, Simpson RJ: An evaluation, comparison, 
and accurate benchmarking of several publicly available MS/MS search 
algorithms: sensitivity and specificity analysis. Proteomics 2005, 
5(13):3475-3490. 

40. Lee HJ, Na K, Kwon MS, Park T, Kim KS, Kim H, Paik YK: A new versatile 
peptide-based size exclusion chromatography platform for global 
profiling and quantitation of candidate biomarkers in hepatocellular 
carcinoma specimens. Proteomics 2011, 11:1976-1984. 

41. Malcolm SB, Peter K: Mass spectral compatibility of four proteomics 
stains. Journal of Proteome Research 2007, 6:4313-4320. 

42. Lam H, Aebersol R: Using spectral libraries for peptide identification from 
tandem mass spectrometry (MS/MS) data. Curr Protoc Protein Sci 2010, 
60:25.5.1-25.5.9. 

43. Skellam JG: A probability distribution derived from the binomial 
distribution by regarding the probability of success as variable between 
the sets of trials. J ft Stat Soc Ser B (Methodol) 1948, 10:257-261. 

44. Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical 
and powerful approach to multiple testing. J Roy Statist Soc Ser B 1995, 
57:289-300. 



doi:1 0.1 1 86/1 471-21 05-1 2-423 

Cite this article as: Lee er al:. Enhanced peptide quantification using 
spectral count clustering and cluster abundance. BMC Bioinformatics 
2011 12:423. 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at S~\ RioM _j r P ntral 

www.biomedcentral.com/submit \ J ™°l™a central 



